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Abstract 



This thesis concerns sequential- access data compression, i.e., by algorithms that 
read the input one or more times from beginning to end. In one chapter we 
consider adaptive prefix coding, for which we must read the input character by 
character, outputting each character's self-delimiting codeword before reading the 
next one. We show how to encode and decode each character in constant worst- 
case time while producing an encoding whose length is worst-case optimal. In 
another chapter we consider one-pass compression with memory bounded in terms 
of the alphabet size and context length, and prove a nearly tight tradeoff between 
the amount of memory we can use and the quality of the compression we can 
achieve. In a third chapter we consider compression in the read/write streams 
model, which allows us passes and memory both polylogarithmic in the size of the 
input. We first show how to achieve universal compression using only one pass 
over one stream. We then show that one stream is not sufficient for achieving good 
grammar-based compression. Finally, we show that two streams are necessary and 
sufficient for achieving entropy-only bounds. 
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Chapter 1 
Introduction 



Sequential-access data compression is by no means a new subject, but it remains 
interesting both for its own sake and for the insight it provides into other problems. 
Apart from the Data Compression Conference, several conferences often have 
tracks for compression (e.g., the International Symposium on Information Theory, 
the Symposium on Combinatorial Pattern Matching and the Symposium on String 
Processing and Information Retrieval), and papers on compression often appear at 
conferences on algorithms in general (e.g., the Symposium on Discrete Algorithms 
and the European Symposium on Algorithms) or even theory in general (e.g., 
the Symposium on Foundations of Computer Science, the Symposium on Theory 
of Computing and the International Colloquium on Algorithms, Languages and 
Programming). We mention these conference in particular because, in this thesis, 
we concentrate on the theoretical aspects of data compression, leaving practical 
considerations for later. 

Apart from its direct applications, work on compression has inspired the de- 
sign and analysis of algorithms and data structures, e.g., succinct or compact 
data structures such as indexes. Work on sequential data compression in partic- 
ular has inspired the design and analysis of online algorithms and dynamic data 
structures, e.g., prediction algorithms for paging, web caching and computational 
finance. Giancarlo, Scaturro and Utro |GSUj recently described how, in bioinfor- 
matics, compression algorithms are important not only for storage, but also for 
indexing, speeding up some dynamic programs, entropy estimation, segmentation, 
and pattern discovery. 

In this thesis we study three kinds of sequential-access data compression: adap- 
tive prefix coding, one-pass compression with memory bounded in terms of the al- 
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phabet size and context length, and compression in the read/write streams model. 
Adaptive prefix coding is perhaps the most natural form of online compression, 
and adaptive prefix coders are the oldest and simplest kind of sequential compres- 
sors, having been studied for more than thirty years. Nevertheless, in Chapter|2]we 
present the first one that is worst-case optimal with respect to both the length of 
the encoding it produces and the time it takes to encode and decode. In Chapter |3] 
we observe that adaptive alphabetic prefix coding is equivalent to online sorting 
with binary comparisons, so our algorithm from Chapter [2] can easily be turned 
into an algorithm for online sorting. Chapter |4] is also about online sorting but, 
instead of aiming to minimize the number of comparisons (which remains within 
a constant factor of optimal), we concentrate on trying to use sublinear memory, 
in line with research on streaming algorithms. We then study compression with 
memory constraints because, although compression is most important when space 
is in short supply and compression algorithms are often implemented in limited 
memory, most analyses ignore memory constraints as an implementation detail, 
creating a gap between theory and practice. We first study compression in the 
case when we can make only one pass over the data. One-pass compressors that 
use memory bounded in terms of the alphabet size and context length can be 
viewed as finite-state machines, and in Chapter |5] we use that property to prove a 
nearly tight tradeoff between the amount of memory we can use and the quality 
of the compression we can achieve. We then study compression in the read/write 
streams model, which allows us to make multiple passes over the data, change 
them, and even use multiple streams (see |Sch07] ) . Streaming algorithms have 
revolutionized the processing of massive data sets, and the read/write streams 
model is an elegant conversion of the streaming model into a model of external 
memory. By viewing read/write stream algorithms as simply more powerful au- 
tomata, that can use passes and memory both polylogarithmic in the size of the 
input, in Chapter [6] we prove lower bounds on the compression we can achieve with 
only one stream. Specifically, we show that, although we can achieve universal 
compression with only one pass over one stream, we need at least two streams to 
achieve good grammar-based compression or entropy-only bounds. We also com- 
bine previously known results to prove that two streams are sufficient for us to 
compute the Burrows- Wheeler Transform and, thus, achieve low-entropy bounds. 
As corollaries of our lower bounds for compression, we obtain lower bounds for 
computing strings' minimum periods and for computing the Bur rows- Wheeler 
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Transform [B W94j . which came as something of a surprise to us. It seems no 
one has previously considered the problem of finding strings' minimum periods in 
a streaming model, even though some related problems have been studied (see, 
e.g., EM^04 , in which the authors loosen the definition of a repeated substring 
to allow approximate matches). We are currently investigating whether we can 
derive any more such results. 

The chapters in this thesis were written separately and can be read separately; 
in fact, it might be better to read them with at least a small pause between chap- 
ters, so the variations in the models considered do not become confusing. Of 
course, to make each chapter independent, we have had to introduce some degree 
of redundancy. Chapter [2] was written specifically for this thesis, and is based on 
recent joint work |GN] with Yakov Nekrich at the University of Bonn; a summary 
was presented at the University of Bielefeld in October of 2008, and will be pre- 
sented at the annual meeting of the Italy-Israel FIRB project "Pattern Discovery 
in Discrete Structures, with Applications to Bioinformatics" at the University of 
Palermo in February of 2009. Chapter [3] was also written specifically for this 
thesis, but Chapter |4] was presented at the 10th Italian Conference on Theoret- 
ical Computer Science |Gag07b| and then published in Information Processing 
Letters |Gag08| . Chapter [s] is a slight modification of part of a paper jGM07b] 
written with Giovanni Manzini at the University of Eastern Piedmont, which 
was presented in 2007 at the 32nd Symposium on Mathematical Foundations of 
Computer Science. A paper |GKN09] we wrote with Marek Karpinski (also at the 
University of Bonn) and Yakov Nekrich that partially combines the results in these 
two chapters, will appear at the 2009 Data Compression Conference. Chapter |6] 
is a slight modification of a paper Gagb| that has been submitted to a confer- 
ence, with a very brief summary of some material from a paper |GM07a] written 
with Giovanni Manzini and presented at the 18th Symposium on Combinatorial 
Pattern Matching. 
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Chapter 2 

Adaptive Prefix Coding 



Prefix codes are sometimes called instantaneous codes because, since no codeword 
is a prefix of another, the decoder can output each character once it reaches 
the end of its codeword. Adaptive prefix coding could thus be called "doubly 
instantaneous" , because the encoder must produce a self-delimiting codeword for 
each character before reading the next one. The main idea behind adaptive prefix 
coding is simple: both the encoder and the decoder start with the same prefix 
code; the encoder reads the first character of the input and writes its codeword; 
the decoder reads that codeword and decodes the first character; the encoder and 
decoder now have the same information, and they update their copies of the code 
in the same way; then they recurse on the remainder of the input. The two main 
challenges are, first, to efficiently update the two copies of the code and, second, 
to prove the total length of the encoding is not much more than what it would be 
if we were to use an optimal static prefix coder. 

Because Huffman codes |Huf52j have minimum expected codeword length, 
early work on adaptive prefix coding naturally focused on efficiently maintain- 
ing a Huffman code for the prefix of the input already encoded. Faller |Fal73] . 
Gallager |Gal78] and Knuth |Knu85] developed an adaptive Huffman coding algo- 
rithm — usually known as the FGK algorithm, for their initials — and showed it 
takes time proportional to the length of the encoding it produces. Vitter |Vit87] 
gave an improved version of their algorithm and showed it uses less than one more 
bit per character than we would use with static Huffman coding. (For simplic- 
ity we consider only binary encodings; therefore, by log we always mean logg.) 
Milidiii, Laber and Pessoa |MLP99j later extended Vitter' s techniques to analyze 
the FGK algorithm, and showed it uses less than two more bits per character 
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than we would use with static Huffman coding. Suppose the input is a string 
s of n characters drawn from an alphabet of size a -C n; let H be the empir- 
ical entropy of s (i.e., the entropy of the normalized distribution of characters 
in s) and let r be the redundancy of a Huffman code for s (i.e., the difference 
between the expected codeword length and H). The FGK algorithm encodes s 
as at most {H + 2 + r)n + o(n) bits and Vitter's algorithm encodes it as at most 
{H + 1 + r)n + o{n) bits; both take 0{{H + l)n) total time to encode and de- 
code, or 0{H + 1) amortized time per character. Table 2.1 summarizes bounds 
for various adaptive prefix coders. 

If s is drawn from a memoryless source then, as n grows, adaptive Huffman 
coding will almost certainly "lock on" to a Huffman code for the source and, thus, 
use {H + r)n + o{n) bits. In this case, however, the whole problem is easy: we 
can achieve the same bound, and use less time, by periodically building a new 
Huffman code. If s is chosen adversarially, then every algorithm uses at least 
{H + 1 + r)n — o(n) bits in the worst case. To see why, fix an algorithm and 
suppose a = n^^"^ = 2^ + 1 for some integer i, so any binary tree on a leaves has at 
least two leaves with depths at least i + 1. Any prefix code for a characters can 
be represented as a code-tree on a leaves; the length of the lexicographically ith 
codeword is the depth of the ith leaf from the left. It follows that an adversary 
can choose s such that the algorithm encodes it as at least {i + l)n bits. On the 
other hand, a static prefix coding algorithm can assign codewords of length i to 
the a — 2 most frequent characters and codewords of length £ + 1 to the two least 
frequent characters, and thus use at most in + 2n/ a + 0{a log a) = £n + o{n) bits. 
Therefore, since the expected codeword length of a Huffman code is minimum, 
[H + r)n < in + o{n) and so [i + l)n > {H + 1 + r)n — o{n). 

This lower bound seems to say that Vitter's upper bound cannot be signifi- 
cantly improved. However, to force the algorithm to use {H + 1 + r)n — o{n) bits, 
it might be that the adversary must choose s such that r is small. In a previous 
paper |Gag07a] we were able to show this is the case, by giving an adaptive prefix 
coder that encodes s in at most {H +l)n + o{n) bits. This bound is perhaps a little 
surprising, since our algorithm was based on Shannon codes |Sha48j . which gener- 
ally do not have minimum expected codeword length. Like the FGK algorithm and 
Vitter's algorithm, our algorithm used 0{H + 1) amortized time per character to 
encode and decode. Recently, Karpinski and Nekrich |KN] showed how to combine 
some of our results with properties of canonical codes, defined by Schwartz and 
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Kallick |SK64j (see also |KleOOl ll'MOOl ITMOlj ). to achieve essentially the same 
compression while encoding and decoding each character in 0{1) amortized time 
and 0{log{H + 2)) amortized time, respectively. Nekrich |Nek07] implemented 
their algorithm and observed that, in practice, it is significantly faster than arith- 
metic coding and slightly faster Turpin and Moffat's GEO coder |TM01] . although 
the compression it achieves is not quite as good. The rest of this chapter is based 
on joint work with Nekrich |GNj that shows how, on a RAM with n(logn)-bit 
words, we can speed up our algorithm even more, to both encode and decode in 
0(1) worst-case time. We note that Rueda and Oommen |R()n4[ IROOGi IROOS] 
have demonstrated the practicality of certain implementations of adaptive Fano 
coding, which is somewhat related to our algorithm ,especially a version |Gag04 
in which we maintained an explicit code-tree. 



Table 2.1: Bounds for adaptive prefix coding: the times to encode and decode 
each character and the total length of the encoding. Bounds in the first row and 
last column are worst-case; the others are amortized. 





Encoding 


Decoding 


Length 


Gagie and Nekrich [jGNJ 


0(1) 


0(1) 


{H + l)n + o{n) 


Karpinski and Nekrich (KNJ 


0(1) 


0(log(i7 + 2)) 


{H + l)n + o{n) 


Gagie GagOTa 


0{H + 1) 


0{H + 1) 


{H + l)n + o{n) 


Vitter |Vit87J 


0{H + 1) 


0{H + 1) 


{H +l + r)n + o{n) 


Knuth |Knu85J 








Gallager |Gal78J 


1 0(i^ + l) 


0{H + 1) 


{H + 2 + r)n + o{n) 


Faller |Fal73j 









2.1 Algorithm 

A Shannon code is one in which, if a character has probability then its code- 
word has length at most [log(l/p)]. In his seminal paper on information theory. 
Shannon |Sha48j showed how to build such a code for any distribution containing 
only positive probabilities. 

Theorem 2.1 (Shannon, 1948). Given a probability distribution P = pi, . . . 
with pi > ■ ■ ■ > Pa > 0, we can build a prefix code in 0{a) time whose codewords, 
in lexicographic order, have lengths [log(l/pi)], . . . , [log(l/po-)l ■ 
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We encode each character of s using a canonical Shannon code for a probability 
distribution that is roughly the normalized distribution of characters in the prefix 
of s already encoded. In order to avoid having to consider probabilities equal to 
0, we start by assigning every character a count of 1. This means the smallest 
probability we ever consider is at least l/{n + a), so the longest codeword we ever 
consider is O(logn) bits. 

A canonical code is one in which the first codeword is a string of Os and, 
for 1 < 2 < 0", we can obtain the {i + l)st codeword by incrementing the ith 
codeword (viewed as a binary number) and appending some number of Os to it. 



For example, Figure 2A_ shows the codewords in a canonical code, together with 
their lexicographic ranks. By definition, the difference between two codewords of 
the same length in a canonical code, viewed as binary numbers, is the same as the 



difference between their ranks. For example, the third codeword in Figure |2.1| is 
0100 and the sixth codeword is 0111, and (0111)2 - (0100)2 = 6-3, where (■)2 
means the argument is to be viewed as a binary number. We use this property 
to build a representation of the code that lets us quickly answer encoding and 
decoding queries. 

1) 000 7) 1000 

2) 001 8) 1001 

3) 0100 9) 10100 

4) 0101 10) 10101 

5) 0110 : 

6) 0111 16) 11011 

Figure 2.1: The codewords in a canonical code. 



We maintain the following data structures: an array Ai that stores the code- 
words' ranks in lexicographic order by character; an array A2 that stores the 
characters and their frequencies in order by frequency; a dictionary Di that stores 
the rank of the first codeword of each length, with the codeword itself as auxiliary 
information; and a dictionary D2 that stores the first codeword of each length, 
with its rank as auxiliary information. 

To encode a given character a, we first use Ai to find a's codeword's rank; then 
use Di to find the rank of the first codeword of the same length and that codeword 
as auxiliary information; then add the difference in ranks to that codeword, viewed 
as a binary number, to obtain a's codeword. For example, if the codewords are as 
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shown in Figure 2.1 and a is the jth character in the alphabet and has codeword 
0111, then 

1- Ab1 =6, 

2. Di.pred(6) = (3,0100), 

3. (0100)2 + 6-3 = (0111)2. 

To decode a given a binary string prefixed with its codeword, we first search 
in D2 for the predecessor of the first [fog(n + cr)] bits of the binary string, to 
find the first codeword of the same length and that codeword's rank as auxiliary 
information; then add that rank to the difference in codewords, viewed as binary 
numbers, to obtain a's codeword's rank; then use A2 to find a. In our example 
above, 

1. L>2-pred(0111 . . .) = (0100,3), 

2. 3 + (0111)2 -(0100)2 = 6, 

3. ^2(6] = aj. 

Admittedly, reading the first [log(n + a)] bits of the binary string will generally 
result in the decoder reading past the end of most codewords before outputting 
the corresponding character. We do not see how to avoid this without potentially 
having the decoder read some codewords bit by bit. 

Querying Ai and A2 takes 0{1) worst-case time, so the time to encode and 
decode depends mainly on the time needed for predecessor queries on Di and 
D2- Since the longest codeword we ever consider is 0{\ogn) bits, each dictionary 
contains Oilogn) keys, so we can implement each as an instance of the data 
structure described below, due to Fredman and Willard |FW93] . This way, apart 
from the time to update the dictionaries, we encode and decode each character in a 
total of 0{1) worst-case time. Andersson, Miltersen and Thorup |ABT99] showed 
Fredman and Willard's data structure can be implemented with AC*^ instructions, 
and Thorup |Tho03j showed it can be implemented with AC*^ instructions available 
on a Pentium 4; admittedly, though, in practice it might still be faster to use a 
sorted array to encode and decode each character in 0(loglogn) time. 

Lemma 2.2 (Fredman & Willard, 1993). Given oi^og"^^ 71^ keys, we can build 

a dictionary in o(log^^^ worst-case time that stores those keys and supports 
predecessor queries in 0{1) worst-case time. 
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Corollary 2.3. Given 0{\ogn) keys, we can build a dictionary in 0\\og ' nj 
worst-case time that stores those keys and supports predecessor queries in 0{1) 
worst-case time. 



Proof. We store the keys at the leaves of a search tree with degree O \\og^^^ nj , 

size o(\og^^^ and height at most 5. Each node stores an instance of Fred- 
man and Willard's dictionary from Lemma 22 each dictionary at a leaf stores 
o(log^^^ keys and each dictionary at an internal node stores the first key in 
each of its children's dictionaries. It is straightforward to build the search tree in 
ol^og^^^'^^^^ = o(\og^^^ time and implement queries in 0{1) time. □ 



Since a codeword's lexicographic rank is the same as the corresponding char- 
acter's rank by frequency, and a character's frequency is an integer than changes 
only by being incremented after each of its occurrence, we can use a data struc- 
ture due to Gallager |Gal78j to update Ai and A2 in worst-case time per 
character of s. We can use 0{\ogn) binary searches in A2 and (9(log^n) time 
to compute the number of codewords of each length; building Di and D2 then 
takes o(\og^^^ ri^ time. Using multiple copies of each data structure and stan- 
dard background-processing techniques, we update each set of copies after every 
[log^nj characters and stagger the updates, such that we need spend only 
worst-case time per character and, for 1 < i < n, the copies we use to encode 
the ith character of s will always have been last updated after we encoded the 
{i — [log^ n\ )th character. 

Writing s[i] for the ith character of s, s[l..i] for the prefix of s of length i, and 
occ (s[i], for the number of times s[i] occurs in s[l..i], we can summarize 

the results of this section as the following lemma. 



Lemma 2.4. For 1 < i < n, we can encode s[i] as at most 

, i + o- 

log 

max (occ {s[i\, s\i..i\) — [log^ n\ , l) 

hits such that encoding and decoding it take 0{1) worst-case time. 
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2.2 Analysis 

Analyzing the length of the encoding our algorithm produces is just a matter 
of bounding the sum of the codewords' lengths. Fortunately, we can do this 
using a modification of the proof that adaptive arithmetic coding produces an 
encoding not much longer than the one decrementing arithmetic coding produces 
(see, e.g., [HV92]). 



Lemma 2.5. 



n 

E 

i=l 



log 



i + a 



max (occ {s[i], s[l..i]) — [log^ n\ , l) 



< {H + l)n + 0{cr\og^n) 



Proof. Since {occ (s[i], s[l..i]) : 1 < i < n} and ' )■ are the 

I a a character 

same multiset, 



log 



i + a 



n 

< log(2 + cr) — log max (occ — [log^ , l) + 



n 



i=l 
n 



i=l 



occ(a,s)— [log nj 

= X] + f^) - X] log j + n 

j=l a j=l 

n occ(a,s) 

< log i + a log(n + cr) — log j + a log^ n + 

= log(n!) — ^log(occ (a, s)!) + n + C((Tlog^n) . 



n 



Since 



log(ra!) - ^ log(occ (a, s)\) = log :j=|- 

n 1 la 



nl 



occ (a, s)\ 



is the number of distinct arrangements of the characters in s, we could complete 
the proof by information-theoretic arguments; however, we will use straightfor- 
ward calculation. Specifically, Robbins' extension |Rob55] of Stirling's Formula, 



— ^,.+l/2^_.+l/(12.+l) ^ ^, ^ ^^,+i/2^-.+l/(12.) 
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implies that 

xlogx — xloge < log(x!) < xlogx — xloge + 0(loga;) , 
where e is the base of the natural logarithm. Therefore, since Yla "-"-^^ i^^ ^) ~ 
log(n!) — ^ log(occ (a, s)\) + n + C(crlog^ n) 

a 

= n log n — occ (a, s) log occ (a, s) — 

a 

n log e + occ (a, s) log e + n + 0{a log^ n) 



^ occ (a, s) log 



occ (a, si 
(i/ + l)n + 0((Tlog^n) . 



+ n + 0[a log^ n) 



□ 



Combining Lemmas 2.4 and 2.5 and assuming a = o{n/\og n) immediately 



gives us our result for this chapter. 

Theorem 2.6. We can encode s as at most {H +l)n+o{n) hits such that encoding 
and decoding each character takes 0{1) worst-case time. 
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Online Sorting with Few 
Comparisons 



Comparison-based sorting is perhaps the most studied problem in computer sci- 
ence, but there remain basic open questions about it. For example, exactly how 
many comparisons does it take to sort a multiset? Over thirty years ago, Munro 
and Spira |MS76j proved distribution-sensitive upper and lower bounds that dif- 
fer by (9(nloglogcr), where n is the size of the multiset s and a is the number 
of distinct elements s contains. Specifically, they proved that nH + 0{n) ternary 
comparisons are sufficient and nH — {n — a) log log cr — 0{n) are necessary, where 
H = V„ occ{a,s) 1 — n — ^ deuotes the entropy of the distribution of elements in 

^-^a n occ(a,s) '^•^ 

s and occ (a, s) denotes the multiplicity of the distinct element a in s. Through- 
out, by log we mean log2. Their bounds have been improved in a series of pa- 



pers, summarized in Table 3.1, so that the difference now is slightly less than 



'1 + loge)n 2.44n, where e is the base of the natural logarithm. 



Apart from the bounds shown in Table 3.1, there have been many bounds 
proven about, e.g., sorting multisets in-place or with minimum data movement, 
or in the external-memory or cache-oblivious models. In this chapter we consider 
online stable sorting; online algorithms sort s element by element and keep those 
already seen in sorted order, and stable algorithms preserve the order of equal 
elements. For example, splaysort (i.e., sorting by insertion into a splay tree |ST85] ) 



is online, stable and takes 0{{H + l)n) comparisons and time. In Section |3.1 
we show how, if cr = o(n^/^/ logn), then we can sort s online and stably using 
{H + l)n + o{n) ternary comparisons and 0{{H + l)n) time. In Section 3.2 we 
prove {H + l)n — o{n) comparisons are necessary in the worst case. 
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Table 3.1: Bounds for sorting a multiset using ternary comparisons. 





Upper bound 


Lower bound 


Munro and Raman |MR91J 




{H - \oge)n + 0{logn) 


Fischer |Fis84J 


{H+l)n - a 


{H - log H)n - 0{n) 


Dobkin and Munro |DM80J 




n log (log n o{n) 


Munro and Soira |MS76I 


riH + 0(n) 


nH — {n — a) log log a — 0{n) 



3.1 Algorithm 

Our idea is to sort s by inserting its elements into a binary search tree T, which 
we rebuild occasionally using the following theorem by Mehlhorn |Meh77j . We 
rebuild T whenever the number of elements processed since the last rebuild is 
equal to the number of distinct elements seen by the time of the last rebuild. This 
way, we spend 0{n) total time rebuilding T. 

Theorem 3.1 (Mehlhorn, 1977). Given a probability distribution P = pi, . . . ,pk 

on k keys, with nopi = 0, in 0{k) time we can build a binary search tree containing 
those keys at depths at most log(l/j9i), . . . ,log(l/pfc)- 

To rebuild T after processing i elements of s, to each distinct element a seen so 
far we assign probability occ (a, sfL.i]) /i, where s[l..'i] denotes the first i elements 
of s] we then apply Theorem |3.1[ Notice the smallest probability we consider is 
at least 1/n, so the resulting tree has height at most logn. 

We want T always to contain a node for each distinct element a seen so far, 
that stores a as a key and a linked list of a's occurrences so far. After we use 
Mehlhorn's theorem, therefore, we extend T and then replace each of its leaf by 
an empty AVL tree |AL62j . To process an element s[i] of s, we search for s[i] in 
T; if we find a node v whose key is equal to s[i], then we append s[i] to f's linked 
list; otherwise, our search ends at a node of an AVL tree, into which we insert a 
new node whose key is equal to s[i] and whose linked list contains s[i]. 

If an element equal to s[i] occurs by the time of the last rebuild before we pro- 
cess s [i] , then the corresponding node is at depth at most log j y 

max! occ(s[j],s[l..i])— (j.l j 

in T, so the number of ternary comparisons we use to insert s[i] into T is at most 
that number plus 1; the extra comparison is necessary to check that the algo- 
rithm should not proceed deeper into the tree. Otherwise, since our AVL trees 
always contain at most cr nodes, we use 0{\og n + log cr) = 0(log n) comparisons. 
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Therefore, we use a total of at most 



log -f . r./ M 7\ + n + 0{cr log n) 

^ — ^ max (^occ s[l..zj) — o", Ij 



1=1 



comparisons to sort s and, assuming each comparison takes 0{1) time, a pro- 
portional amount of time. We can bound this sum using the following technical 
lemma, which says the logarithm of the number of distinct arrangements of the 
elements in s is close to nH, and the subsequent corollary. We write to 
denote the distinct elements in s. 

Lemma 3.2. 

ni/-C((Tlog(n/(T)) < log ( , , .A <nH + 0{logn) . 

\occ (ai, s)!, . . . ,occ {a„, sy. J 

Proof. Robbins' extension |Rob55j of Stirling's Formula, 

^y2^^x+l/2g-x+l/(12z+l) < <- ^/2:^3,2:+l/2g-x+l/(12x) ^ 

implies that 

xloga; — xloge < log(x!) < xlogx — xloge + 0{\ogx) . 
Therefore, since o*^*^ i^^ ^) ~ straightforward calculation shows that 

log( . X, ^' . X, ) = log(r^!) - yilog(occ(a,s)!) 
\occ (ai, sj!, . . . , occ (ao-, sj!/ 

is at least nH — 0{a\og{n/a)) and at most < nH + O(logn). □ 
Corollary 3.3. 

n 

7 . r-i r. TV + n + 0{a log n) 

~^ max (^occ (s[zj, s[l..zj j — cr, Ij 

< (H + l)n + 0{a^\ogn) . 

Proof. Since |occ {s[i], s[l..i]) : 1 < i < and I j : ^ — ^ — '^'^'^ I the 

I a an element J 
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same multiset, 



En 



(occ {s[i], s[l..i]) — cr, l) 



max 



i=l 

occ(a,s)— (T 

= log(n!) - ^ hgj 

a j=l 
occ(a,s) 

<log(n!)-^ logj + O(aMogn) 

a j=l 

= log(n!) — ^ log(occ (a, s)!) + (9(cr^ logn) 



log( . ^, . )+0{aHogn) 

\occ (ai, sj!, . . . , occ (Oo-, s)!/ 



< nH + 0{(t'^ logn) 



by Lemma 3.2 If follows that 



Yl 7 . r./ TV + n + 0{a log n) 

' ^ IXlcLX I e I -7 I ell -711 — rr I I 



(occ — cr, l) 



1)72 + logn) □ 



Our upper bound follows immediately from Corollary 3.3 



Theorem 3.4. When a = o{n^^'^ / logn) , we can sort s online and stably using 
{H + l)n + o(n) ternary comparisons and 0{{H + l)n) time. 

3.2 Lower bound 

Consider any online, stable sorting algorithm that uses ternary comparisons. Since 
the algorithm is online and stable, it must determine each element's rank relative 
to the distinct elements already seen, before moving on to the next element. 
Since it uses ternary comparisons, we can represent its strategy for each element 
as an extended binary search tree whose keys are the distinct elements already 
seen. If the current element is distinct from all those already seen, then the 
algorithm reaches a leaf of the tree, and the minimum number of comparisons it 
performs is equal to that leaf's depth. If the current element has been seen before, 
however, then the algorithm stops at an internal node and the minimum number 
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of comparisons it performs is 1 greater than that node's depth; again, the extra 
comparison is necessary to check that the algorithm should not proceed deeper 
into the tree. 

Suppose (J = o{n/ \ogn) is a power of 2; then, in any binary search tree on a 
keys, some key has depth logc > H (the inequality holds because any distribution 
on a elements has entropy at most logo"). Furthermore, suppose an adversary 
starts by presenting one copy of each of a distinct element; after that, it considers 
the algorithm's strategy for the next element as a binary search tree, and presents 
the deepest key. This way, the adversary forces the algorithm to use at least 
{n — cr) (log cr + 1) > (if + l)n — o{n) comparisons. 

Theorem 3.5. We generally need {H + l)n — o{n) ternary comparisons to sort s 
online and stably. 
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Chapter 4 



Online Sorting with Sublinear 
Memory 

When in doubt, sort! Librarians, secretaries and computer scientists all know 
that when faced with lots of data, often the best thing is to organize them. For 
some applications, though, the data are so overwhelming that we cannot sort. The 
streaming model was introduced for situations in which the flow of data cannot be 
paused or stored in its entirety; the model's assumptions are that we are allowed 
only one pass over the input and memory sublinear in its size (see, e.g., |Mut05] ). 
Those assumptions mean we cannot sort in general, but in this chapter we show 
we can when the data are very compressible. 

Our inspiration comes from two older articles on sorting. In the first, "Sorting 
and searching in multisets" from 1976, Munro and Spira |MS76] considered the 
problem of sorting a multiset s of size n containing a distinct elements in the 
comparison model. They showed sorting s takes Q{{H + l)n) time, where H = 
Yli'i=ii^i/^) ^og{n/ni) is the entropy of s, log means log2 and rii is the frequency of 
the ith smallest distinct element. When a is small or the distribution of elements 
in s is very skewed, this is a significant improvement over the G(nlogn) bound 
for sorting a set of size n. 

In the second article, "Selection and sorting with limited storage" from 1980, 
Munro and Paterson |MP80j considered the problem of sorting a set s of size n 
using limited memory and few passes. They showed sorting s in p passes takes 
Q{n/p) memory locations in the following model (we have changed their variable 
names for consistency with our own): 



19 



CHAPTER 4. ONLINE SORTING WITH SUBLINEAR MEMORY 



In our computational model the data is a sequence of n distinct ele- 
ments stored on a one-way read-only tape. An element from the tape 
can be read into one of r locations of random-access storage. The 
elements are from some totally ordered set (for example the real num- 
bers) and a binary comparison can be made at any time between any 
two elements within the random-access storage. Initially the storage is 
empty and the tape is placed with the reading head at the beginning. 
After each pass the tape is rewound to this position with no read- 
ing permitted. . . . [I]n view of the limitations imposed by our model, 
[sorting] must be considered as the determination of the sorted order 
rather than any actual rearrangement. 



An obvious question — but one that apparently has still not been addressed 
decades later — is how much memory we need to sort a multiset in few passes; 
in this chapter we consider the case when we are allowed only one pass. We 
assume our input is the same as Munro and Spira's, a multiset s = {si, . . . , s„} 
with entropy H containing a distinct elements. To simplify our presentation, we 
assume a > 2 so Hn = Q{\ogn). Our model is similar to Munro and Paterson's 
but it makes no difference to us whether the tape is read-only or read-write, since 
we are allowed only one pass, and whereas they counted memory locations, we 
count bits. We assume machine words are G(logn) bits long, an element fits in a 
constant number of words and we can perform standard operations on words in 
unit time. Since entropy is minimized when the distribution is maximally skewed, 

f n — cr + 1 n a — 1 \ 

Hn > n log 1 log n > ( cr — 1) log n ; 

\ n n — cr+1 n J 

thus, under our assumptions, 0{a) words take 0{a\ogn) C 0{Hn) bits. 



In Section 4.1 we consider the problem of determining the permutation vr such 
that S7r(i), . . . , S7r(n) IS the stable sort of s (i.e., S7r(j) < and, if s^(j) = 57^(4+1), 

then Ti{i) < -n{i + 1)). For example, if 

s = ai, fei, ri, 02, c, as, d, 04, 62, ?^2, 05 

(with subscripts serving only to distinguish copies of the same distinct element). 
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then the stable sort of s is 

ai, 02, as, 04, as, 61, 62, c, (i, ri, r2 

= Si, S4, S6, S8, Sii, S2, Sg, S5, S7, S3, Sio 

and 

TT = 1,4,6,8,11,2,9,5,7,3,10. 
We give a simple algorithm that computes vr using one pass, 0{{H + l)n) time 



and 0{Hn) bits of memory. In Section 4.2 we consider the simpler problem 
of determining a permutation p such that . . . , Sp(„) is in sorted order (not 
necessarily stably-sorted). We prove that in the worst case it takes Q{Hn) bits of 
memory to compute any such p in one pass. 



4.1 Algorithm 

The key to our algorithm is the fact vr = ii - ■ ■ i^, where ii is the sorted list 
of positions in which the ith smallest distinct element occurs. In our example, 
s = a,b, r, a, c, a, d, a, 6, r, a, 

ii = 1,4,6,8,11 
4 = 2,9 
4 = 5 
£4 = 7 
4 = 3,10. 

Since each 4 is a strictly increasing sequence, we can store it compactly using 
Elias' gamma code |Eli75] : we write the first number in 4, encoded in the gamma 
code; for 1 < j < rii, we write the difference between the (j + l)st and jth 
numbers, encoded in the gamma code. The gamma code is a prefix-free code for 
the positive integers; for x > 1, 7(0;) consists of [logxj zeroes followed by the 
([logxj -|- l)-bit binary representation of x. In our example, we encode ii as 

7(1) 7(3) 7(2) 7(2) 7(3) = 1 Oil 010 010 Oil . 

Lemma 4.1. We can store vr in 0{Hn) hits of memory. 
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Proof. Encoding the length of every list with the gamma code takes 0{a log n) C 
0{Hn) bits. Notice the numbers in each list sum to at most n. By Jensen's 
Inequality, since |7(x)| < 21ogx + 1 and log is concave, we store £i in at most 
2njlog(n/nj) + rii bits. Therefore, storing £i, . . . as described above takes 

^(2ni log(n/ni) + n,) = 0{{H + l)n) 

i=l 

bits of memory. 

To reduce 0{{H + l)n) to 0{Hn) — important when one distinct element 
dominates, so H is close to and Hn <^ n — we must avoid writing a codeword 
for each element in s. Notice that, for each run of length at least 2 in s (a run 
being a maximal subsequence of copies of the same element) there is a run of I's 
in the corresponding list £{. We replace each run of I's in £j by a single 1 and the 
length of the run. For each except the last run of each distinct element, the run- 
length is at most the number we write for the element in s immediately following 
that run; storing the last run- length for every character takes 0{a log n) C 0{Hn) 
bits. It follows that storing ii, . . . takes 

a cr 

J2 ^i^i log(n/ri) +ri)<J2 ^i^i ^ogin/m)) + C(r) = 0{Hn + r) 

i=l 1=1 

bits, where r, is the number of runs of the ith smallest distinct element and r is 
the total number of runs in s. Makinen and Navarro |MN05j showed r < Hn + 1, 
so our bound is 0{Hn). □ 

To compute ii,. . . in one pass, we keep track of which distinct elements 
have occurred and the positions of their most recent occurrences, which takes 0{a) 
words of memory. For 1 < j < n, if Sj is an occurrence of the ith smallest distinct 
element and that element has not occurred before, then we start £j's encoding 
with 7(j); if it last occurred in position k < j — 2, then we append 7(j — k) to 
ii] if it occurred in position j — 1 but not j — 2, then we append 7(1) 7(1) to £i; 
if it occurred in both positions j — 1 and j — 2, then we increment the encoded 
run-length at the end of £j's encoding. Because we do not know in advance how 
many bits we will use to encode each ii, we keep the encoding in an expandable 
binary array |CLRSOl"] : we start with an array of size 1 bit; whenever the array 
overflows, we create a new array twice as big, copy the contents from the old array 
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into the new one, and destroy the old array. We note that appending a bit to the 
encoding takes amortized constant time. 

Lemma 4.2. We can compute ii, . . . ,£„ in one pass using 0{Hn) hits of memory. 

Proof. Since we are not yet concerned with time, for each element in s we can 
simply perform a linear search — which is slow but uses no extra memory — 
through the entire list of distinct elements to find the encoding we should extend. 
Since an array is never more than twice the size of the encoding it holds, we use 
0{Hn) bits of memory for the arrays. □ 



To make our algorithm time-efficient, we use search in a splay tree |ST85] 
instead of linear search. At each node of the splay tree, we store a distinct element 
as the key, the position of that element's most recent occurrence and a pointer to 
the array for that element. For 1 < j < n, we search for Sj in the splay tree; if we 
find it, then we extend the encoding of the corresponding list as described above, 
set the position of s/s most recent occurrence to j and splay s/s node to the root; 
if not, then we insert a new node storing Sj as its key, position j and a pointer 



to an expandable array storing 7(j), and splay the node to the root. Figure 4.1 



shows the state of our splay tree and arrays after we process the first 9 elements 



in our example; i.e., a,b,r,a,c,a,d,a,b. Figure 4.2 shows the changes when we 
process the next element, an r: we double the size of r's array from 4 to 8 bits 
in order to append 7(10 — 3 = 7) = 00111, set the position of r's most recent 



occurrence to 10 and splay r's node to the root. Figure 4.3 shows the final state 
of our splay tree and arrays after we process the last element, an a: we append 
7(11 — 8 = 3)= Oil to a's array (but since only 10 of its 16 bits were already 
used, we do not expand it), set the position of a's most recent occurrence to 11 
and splay a's node to the root. 

Lemma 4.3. We can compute £1, . . . ,io- in one pass using 0{{H + l)n) time and 
0{Hn) bits of memory. 

Proof. Our splay tree takes 0{a) words of memory and, so, does not change the 
bound on our memory usage. For 1 < i < a we search for the ith largest distinct 
element once when it is not in the splay tree, insert it once, and search for it — 1 
times when it is in the splay tree. Therefore, by the Update Lemma |ST85j for 
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b,9 



01000111 



a, 8 



c, 5 -H 00101 



1011010010 



d,7 



00111 



r,3 



Oil 



Figure 4.1: Our splay tree and arrays after we process a, b, r, a, c, a, d, a, b. 



a, 8 




00111 



T 



1011010010 



c, 5 



00101 



Figure 4.2: Our splay tree and arrays after we process a, b, r, a, c, a, d, a, b, r; notice 
we have doubled the size of the array for r, in order to append 7(7) — 00111. 



a, 11^1011010010011 



"X 



b,9 



01000111 



r, 10 



01100111 



d,7 



00111 



c,5 



00101 



Figure 4.3: Our splay tree and arrays after we process a, b, r, a, c, a, d, a, b, r, a; 
notice we have not had to expand the array for a in order to append 7(8) = Oil. 
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splay trees, the total time taken for all the operations on the splay tree is 
> C log —— r + log — + nj , 

where w\, . . . ^w^ are any positive weights, W is their sum and wq — Wa+i — oo. 
Setting Wi — rii lor 1 < i < cr, this bound becomes 

cr 

Y,0{{ni + 2){\og{n/ni) + 1)) ^ 0{{H + l)n) . 
1=1 

Because appending a bit to an array takes amortized constant time, the total time 
taken for operations on the arrays is proportional to the total length in bits of the 
encodings, i.e., 0{Hn). □ 

After we process all of s, we can compute tt from the state of the splay tree 
and arrays: we perform an in-order traversal of the splay tree; when we visit a 
node, we decode the numbers in its array and output their positive partial sums 
(this takes 0{1) words of memory and time proportional to the length in bits 
of the encoding, because the gamma code is prefix- free); this way, we output the 
concatenation of the decoded lists in increasing order by element, i.e., £i ■ ■ ■ £o- = tt. 
In our example, we visit the nodes in the order a, 6, c, r; when we visit a's node 
we output 







7-^(1) = 


1 


1 + 7" 


■1(011) 


= 1 + 3 = 


4 


4-1-7" 


-1(010) 


= 4 + 2 = 


6 


6 + 7" 


-1(010) 


= 6 + 2 = 


8 


8 + 7" 


-1(011) 


= 8 + 3 = 


11 



Our results in this section culminate in the following theorem: 

Theorem 4.4. We can compute tt in one pass using 0{{H + l)n) time and 
0{Hn) bits of memory. 

We note s can be recovered efficiently from our splay tree and arrays: we start 
with an empty priority queue Q and insert a copy of each distinct element, with 
priority equal to the position of its first occurrence (i.e., the first encoded number 
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in its array); for 1 < j < n, we dequeue the element with minimum priority, 
output it, and reinsert it with priority equal to the position of its next occurrence 
(i.e., its previous priority plus the next encoded number in its array). This idea 
— that a sorted ordering of s partially encodes it — is central to our lower bound 
in the next section. 



4.2 Lower bound 

Consider any algorithm A that, allowed one pass over s, outputs a permutation p 



A generally cannot output anything until it has read all of s, in case s„ is the 
unique minimum; also, given the frequency of each distinct element, p tells us the 
arrangement of elements in s up to equivalence. 

Theorem 4.5. In the worst case, it takes fl{Hn) bits of memory to compute any 
sorted ordering of s in one pass. 

Proof. Suppose each ni = n/a, so H = log a and the number of possible distinct 
arrangements of the elements in s is maximized. 



such that s 



. . . , Sp(n) is in sorted order 




It follows that in the worst case A uses at least 




{(n/a) log(n/a) — (n/cr) loge + C(log(n/cr 



))) 



bits of memory to store p; the inequality holds by Stirling's Formula, 



X log X — X log e < log x\ < X log x — x log e + 0(log x) . 



If (7 = 0{1) then 



alog{n/a) — 0{logn) C o{n) ; 
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otherwise, since alog{n/a) is maximized when a = n/e, 

a\og{n/a) = 0{n) C o(n log cr) ; 

in both cases, 

nloga — O{alog{n/a)) > nlogcr — o{nloga) > Q.{Hn) . □ 
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Chapter 5 

One-Pass Compression 



Data compression has come of age in recent years and compression algorithms 
are now vital in situations unforeseen by their designers. This has led to a dis- 
crepancy between the theory of data compression algorithms and their use in 
practice: compression algorithms are often designed and analysed assuming the 
compression and decompression operations can use a "sufficiently large" amount 
of working memory; however, in some situations, particularly in mobile or em- 
bedded computing environments, the memory available is very small compared to 
the amount of data we need to compress or decompress. Even when compression 
algorithms are implemented to run on powerful desktop computers, some care is 
taken to be sure that the compression/decompression of large files do not take 
over all the RAM of the host machine. This is usually accomplished by splitting 
the input in blocks (e.g., bzip), using heuristics to determine when to discard the 
old data (e.g., compress, ppmd), or by maintaining a "sliding window" over the 
more recently seen data and forgetting the oldest data (e.g., gzip). 

In this chapter we initiate the theoretical study of space-conscious compression 
algorithms. Although data compression algorithms have their own peculiarities, 
this study belongs to the general field of algorithmics in the streaming model (see, 
e.g., |BBD+02[ IMutOSj ). in which we are allowed only one pass over the input 
and memory sublinear (possibly polylogarithmic or even constant) in its size. 
We prove tight upper and lower bounds on the compression ratio achievable by 
one-pass algorithms that use an amount of memory independent of the size of the 
input. By "one-pass", we mean that the algorithms are allowed to read each input 
symbol only once; hence, if an algorithm needs to access (portions of) the input 
more than once it must store it — consuming part of its precious working memory. 
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Our bounds are worst-case and given in terms of the empirical fcth-order empirical 
entropy of the input string. More precisely we prove the following results: 

(a) Let X > 1, k > and e > be constants and let g he a function independent 
of n. In the worst case it is impossible to store a string s of length n over 
an alphabet of size a in XHk{s)n + o(nlogcr) + g bits using one pass and 

bits of memory. 

(b) Given a {\Hk{s) + o{n log a) + (?)-bit encoding of s, it is impossible to recover 
s using one pass and (9(cr'^"'"^/^~'^) bits of memory. 

(c) Given A > 1, A; > and /i > 0, we can store s in \Hk{s)n + fin + 
(^J^o-fe+i/-** log cr) bits using one pass and 0(^a''^^^^\og'^ a) bits of memory, 
and later recover s using one pass and the same amount of memory. 

While a is often treated as constant in the literature, we treat it as a variable to 
distinguish between, say, (9(cr'^+^/'^~'^) and 0[a''^^^^\og'^ a) bits. Informally, (a) 
provides a lower bound to the amount of memory needed to compress a string up 
to its fcth-order entropy; (b) tells us the same amount of memory is required also 
for decompression and implies that the use of a powerful machine for doing the 
compression does not help if only limited memory is available when decompression 
takes place; (c) establishes that (a) and (b) are nearly tight. Notice A plays a 
dual role: for large k, it makes (a) and (b) inapproximability results — e.g., we 
cannot use 0[a^) bits of memory without worsening the compression in terms 
of -fffc(s) by more than a constant factor; for small k, it makes (c) an interesting 
approximability result — e.g., we can compress reasonably well in terms of Hq{s) 
using, say, 0{^/o^) bits of memory. The main difference between the bounds in 
(a)-(b) and (c) is a cr*^ logger factor in the memory usage. Since yU is a constant, 
fin G o(nlogcr) and the bounds on the encoding's length match. Note that fi can 



be arbitrarily small, but the term fin cannot be avoided (Lemma 5.5). 

We use s to denote the string that we want to compress. We assume that 
s has length n and is drawn from an alphabet of size a. Note that we measure 
memory in terms of alphabet size so cr is considered a variable. The Oth-order 
empirical entropy Hq{s) of s is defined as Hq{s) = J2a "'^^^^'^^ log ^^^"^ , where 
occ (a, s) is the number of times character a occurs in s; throughout, we write log 
to mean log2 and assume OlogO = 0. It is well known that Hq is the maximum 
compression we can achieve using a fixed codeword for each alphabet symbol. We 
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can achieve a greater compression if the codeword we use for each symbol depends 
on the k symbols preceding it. In this case the maximum compression is bounded 
by the fcth-order entropy Hk{s) (see |KM99] for the formal definition). We use 
two properties of fcth-order entropy in particular: 

• Hk{Si)\Si\ + Hkis2)\s2\ < Hk{SiS2)\SiS2\, 

• since Ho{s) < log \ {a : a occurs in s}\, we have 

Hk{s) < log max {j : w is followed by j distinct characters in s} . 

\w\=k 

We point out that the empirical entropy is defined pointwise for any string and 
can be used to measure the performance of compression algorithms as a function 
of the string's structure, thus without any assumption on the input source. For 
this reason we say that the bounds given in terms of Hk are worst-case bounds. 

Some of our arguments are based on Kolmogorov complexity |LV08] : the Kol- 
mogorov complexity of s, denoted K{s), is the length in bits of the shortest 
program that outputs s; it is generally incomputable but can be bounded from 
below by counting arguments (e.g., in a set of m elements, most have Kolmogorov 
complexity at least log m — (9(1)). We use two properties of Kolmogorov complex- 
ity in particular: if an object can be easily computed from other objects, then its 
Kolmogorov complexity is at most the sum of theirs plus a constant; and a fixed, 
finite object has constant Kolmogorov complexity. 

5.1 Algorithm 

Move-to-front compression |BSTW86] is probably the best example of a compres- 
sion algorithm whose space complexity is independent of the input length: keep 
a list of the characters that have occurred in decreasing order by recency; store 
each character in the input by outputting its position in the list (or, if it has not 
occurred before, its index in the alphabet) encoded in Elias' 6 code, then move 
it to the front of the list. Move-to- front stores a string s of length n over an 
alphabet of size a in {Hq{s) + 0{\ogHQ{s))) n + 0{a\oga) bits using one pass 
and 0{a\oga) bits of memory. When memory is scarce, we can use 0{a^^^ logcr) 
bits of memory by storing only the [cr^/'^] most recent characters; it is easy to see 
that this increases the number of bits stored by a factor of at most A. On the 
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other hand, note that we can store s in [H^^s) + 0(\ogHk{s))) n + C((t''+^ logo") 
bits by keeping a separate hst for each possible context of length k; this increases 
the memory usage by a factor of at most a^. In this section we first use a more 
complicated algorithm to get a better upper bound: given constants X > 1, k > 
and /i > 0, we can store s in {XHk{s) + fi)n + 0[a^~^^^^loga) bits using one pass 
and 0(0"'^'''^/^ log^ 0") bits of memory. 



We start with the following lemma — based on a previous paper GagOGb 



about compression algorithms' redundancies — that says we can an approximation 
Q of a probability distribution P in few bits, so that the relative entropy between 
P and Q is small. The relative entropy D{P\\Q) = Yl'i=iPi^'^siPi/li) between 
P = Pi, . . . , Po- and Q = gi, . . . , go- is the expected redundancy per character of an 
ideal code for Q when characters are drawn according to P. 



Lemma 5.1. Let s be a string of length n over an alphabet of size a and let P be 
the normalized distribution of characters in s. Given s and constants A > 1 and 
fi > 0, we can store a probability distribution Q with D{P\\Q) < (A — 1)H{P) + /i 
in 0(^a^^^\og{n + cr)) bits using 0{(t^^^ \og{n + cr)) bits of memory. 



Proof. Suppose P = pi, . . . ,p„. We can use an C(nlogn)-time algorithm due to 
Misra and Cries |MG82j (see also |DLM02l IKSP03j ) to find the t < ra^!^ values 
of i such that pi > l/(rcr^/'^), where r = 1 + ^717^3^, using 0(0"^/'^ logmax(n, a)) 
bits of memory; or, since we are not concerned with time in this chapter, we can 
simply make a passes over s to find these t values. For each, we store i and 
[pir'^a\; since r depends only on /i, in total this takes 0[a^^^loga) bits. This 
information lets us later recover Q = qi, . . . ,qcr where 

' (1- l/r)[p,rVj 

Elb.rVJ : >l/(mVA)} 
qi= < ^ 

. ria-t) 

Suppose Pi > l/{ra^^'^); then pir'^a > r. Since Yl { [Pj^^'^J • Pj — ^/ira^^'^)} < 



if Pi > l/(r(Ti/^), 
otherwise. 
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Pi log(Pi/g, 
< Pi log 



r — 1 [pir^crj 
J, 

< ^Pi log 

r — 1 

= Pi/i- 

Now suppose < l/(ra^^-^); then pjlog(l/pj) > (pj/A)logcr. Therefore 

Pi log(piM) 
<j9,log((a-t)/ai/^) 

< (A-l)(p,/A)loga 

< (A - l)pilog(lM). 

Since pjlog(pi/gi) < (A - l)pilog{l/pi) + Pi/J, in both cases, D{P\\Q) < (A - 
l)i7(P)+/i. □ 

Armed with this lemma, we can adapt arithmetic coding to use log(n + a)) 

bits of memory with a specified redundancy per character. 

Lemma 5.2. Given a string s of length n over an alphabet of size a and constants 
A > 1 and fx > 0, we can store s in {XHo{s) + jj,)n + (9(cr^/'^ log(n + a)) bits using 
0[a^^^ \og{n + cr)) bits of memory. 

Proof. Let P be the normalized distribution of characters in s, so H{P) = Ho{s). 



First, as described in Lemma |5.1[ we store a probability distribution Q with 
D{P\\Q) < (A - l)H{P) + /i/2 in 0{a^/^ log a) bits using 0{a^/^ \og{n + A)) bits 
of memory. Then, we process s in blocks Si, . . . ,Sb of length [4/ /i] (except Sb may 
be shorter). For 1 < i < 6, we store Sj as the first [log(2/Pr[X = Sj])] bits to the 
right of the binary point in the binary representation of 

f{si) = Ft[X < s,]+Ft[X = s,]/2 



Pr[X = s,]/2 



X[l] = . . . ,X[j - 1] = s,[j - l],X[j] < s,[j] 



+ 
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where X is a string of length [4//i] chosen randomly according to Q, X < Si 
means X is lexicographically less than Si, and X[j] and Si[j] indicate the indices 
in the alphabet of the jth characters of X and Si, respectively. Notice that, 
since |/(sj) — f{y)\ > Pr[X = Sj]/2 for any string y ^ Si oi length [4//i], these 
bits uniquely identify /(sj) and, thus, Sj. Also, since the probabilities in Q are 
C(logcr)-bit numbers, we can compute /(sj) from Si with 0{a) additions and 
(9(l//i) = 0{1) multiplications using 0{ioga) bits of memory. (In fact, with 
appropriate data structures, 0{\oga) additions and 0{1) multiplications suffice.) 
Finally, we store Sb in |sfc|[log(T] = 0{\oga) bits. In total we store s in 

^[log(2/Pr[X = + 0(ai/Moga) 

i=l 

b-i /r4/Mi \ 

^ E E + 2 + 0{a'/^ log a) 

i=i \j=i J 



Y,P^ log(lM) + 2(6 - 1) + 0{a'l^ log a) 



n 

i=l 



< n{D{P\\Q) + H{P)) + /in/2 + 0{a^/^ log a) 

< {\Ho{s) + /i)n + C((T^/^ log a) 

bits using 0(^a^^'^log{n + a)) bits of memory. □ 

We boost our space-conscious arithmetic coding algorithm to achieve a bound 
in terms of Hk{s) instead of Ho{s) by running a separate copy for each possible 
/c-tuple, just as we boosted move-to-front compression. 

Lemma 5.3. Given a string s of length n over an alphabet of size a and constants 
A > 1, k >0 and jj > 0, we can store s in (\Hf^{s) + jj)n + (9(cr^+-^/^ log(n -|- a)) 
bits using 0(^a''^^^^ \og{n + cr)) bits of memory. 



Proof. We store the first k characters of s in 0{\oga) bits then apply Lemma 5.2 
to subsequences si, . . . , s^k, where Sj consists of the characters in s that immedi- 
ately follow occurrences of the lexicographically ith possible fc-tuple. Notice that 
although we cannot keep Si, . . . , s^k in memory, enumerating them as many times 



as necessary in order to apply Lemma 5.2 takes 0{loga) bits of memory. □ 



To make our algorithm use one pass and to change the log{n + a) factor to 
logo", we process the input in blocks Si, . . . , Sfe of length O^a'^^^^^loga) . Notice 
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each individual block Sj fits in memory — so we can apply Lemma 5.3 to it — 
and log(|si| + a) = C(logcr). 

Theorem 5.4. Given a string s of length n over an alphabet of size a and con- 
stants X > 1, k > and fi > 0, we can store s in (XHk{s) + fi)n + O^a'^^^^'^ log cr) 
bits using one pass and 0(^a''^^^'^ log^ cr) bits of memory, and later recover s using 
one pass and the same amount of memory. 



Proof. Let c be a constant such that, by Lemma [5.3| we can store any substring 
Si of s in {XHk{si)+fi/2)\si\+ca''+^/^\og(r bits using 0(^a^^^ log(|sj| + a)) bits of 
memory. We process s in blocks Si, . . . , of length [(2c//i)cr'^"''^/^ log cr] (except 
Sb may be shorter). Notice each block Si fits in (9(cr'^^^/'^ log^ a) bits of memory. 



When we reach Si, we read it into memory, apply Lemma 5.3 to it — using 



C)(^fc+iAiog ([(2c//^)(T^+i/Mogal +cr)) = C>(cr*^+^/Moga) 
bits of memory — then erase it from memory. In total we store s in 

6 

J2 {{XHkis,) + fi/2)\s,\ + ca'^+'/^loga) 
1=1 

< {XHk{s) + ij/2)n + bca''+^/^ log a 

< {XHk{s) + fi)n + ca^+^/^ log a 

bits using (9(cr'^+^/'^ log^ a) bits of memory. 

Notice the encoding of each block Si also fits in O^a'^^^^^ log^ cr) bits of memory. 
To decode each block later, we read its encoding into memory, search through all 
possible strings of length [(2c/ /i)cr'^+^/'^ log cr] in lexicographic order until we find 
the one that yields that encoding — using (9 (cr*^+^/-*' log^ cr) bits of memory — 
and output it. □ 



The method for decompression in the proof of Theorem 5.4 above takes ex- 
ponential time but is very simple (recall we are not concerned with time here); 
reversing each step of the compression takes linear time but is slightly more com- 
plicated. 
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5.2 Lower bounds 



Theorem 5.4 is still weaker than the strongest compression bounds that ignore 
memory constraints, in two important ways: first, even when A = 1 the bound on 
the compression ratio does not approach Hk{s) as n goes to infinity; second, we 
need to know k. It is not hard to prove these weaknesses are unavoidable when 
using fixed memory. In this section, we use the idea from these proofs to prove a 
nearly matching lower bound for compression: in the worst case it is impossible to 
store a string s of length n over an alphabet of size a in XHk{s)n + o{n\oga) + g 
bits, for any function g independent of n, using one encoding pass and (9(cr'^^^/^^^) 
bits of memory. We close with a symmetric lower bound for decompression. 

Lemma 5.5. Let X > 1 be a constant and let g be a function independent ofn. In 
the worst case it is impossible to store a string s of length n in XHQ{s)n + o{n) +g 
bits using one encoding pass and memory independent of n. 

Proof. Let A be an algorithm that, given A, stores s using one pass and memory 
independent of n. Since A's future output depends only on its state and its 
future input, we can model A with a finite-state machine M. While reading 
\M\ characters of s, M must visit some state at least twice; therefore either M 
outputs at least one bit for every \M\ characters in s — or n/\M\ bits in total 
— or for infinitely many strings M outputs nothing. If s is unary, however, then 
Hois) = 0. □ 

Lemma 5.6. Let X be a constant, let g be a function independent of n and let b 
be a function independent ofn and k. In the worst case it is impossible to store a 
string s of length n over an alphabet of size o in XHk{s)n + o(n log cr) + g bits for 
all k > using one pass and b bits of memory. 

Proof. Let A be an algorithm that, given A, g, b and cr, stores s using b bits of 
memory. Again, we can model it with a finite-state machine M, with \M\ = 2* 
and M's Kolmogorov complexity K{M) = K{{A, X, g,b, a)) + = 0{\oga). 
(Since A, A, g, and b are all fixed, their Kolmogorov complexities are 0{1).) 

Suppose s is a periodic string with period 2b whose repeated substring r has 
K{r) = \r \ log cr — (9(1). We can specify r by specifying M, the states M is in when 
it reaches and leaves any copy of r in s, and M's output on that copy of r. (If 
there were another string r' that took M between those states with that output. 
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then we could substitute r' for r in s without changing M's output.) Therefore 
M outputs at least 

K{r) - K{M) - 0(log |M|) = |r| log a - 0{\oga + b)= fi(|r| log a) 

bits for each copy of r in s, or Q{nloga) bits in total. For k > 2b, however, -fffc(s) 
approaches as n goes to infinity. □ 

The idea behind these proofs is simple: model a one-pass algorithm with a 
finite-state machine and evaluate its behaviour on a periodic string. Nevertheless, 
combining it with the following simple results — based on the same previous 



paper GagOGb as Lemma 5.1 — we can easily show a lower bound that nearly 
matches Theorem 5.4 (In fact, our proofs are valid even for algorithms that 
make preliminary passes that produce no output — perhaps to gather statistics, 
like Huffman coding |Huf52j — followed by a single encoding pass that produces 
all of the output; once the algorithm begins the encoding pass, we can model it 
with a finite-state machine.) 

Lemma 5.7. Let X > 1, k > and e > be constants and let r be a randomly 
chosen string of length \_a'^^^^'^^^\ over an alphabet of size a. With high probability 
every possible k-tuple is followed by 0(^a^^'^^''^ distinct characters in r. 

Proof. Consider a /c-tuple w. For 1 < i < n — /c, let Xj = 1 if the ith through 
(z + A; — l)st characters of s are an occurrence of w and the {i + /c)th character 
in s does not occur in w\ otherwise Xi = 0. Notice w is followed by at most 
^2^=1 Xi + k distinct characters in s and Pr[Xj = 1\ Xj = 1] < 1 / and Pr[Xj = 
1 1 = 0] < l/{a^ - 1) for i ^ j. Therefore, by Chernoff bounds (see |HR90] ) 
and the union bound, with probability greater than 

1 — >l-a^ /26^i/^-^ 

26[(7fc+l/A-fJ/(^fe_l) — / 

every fc-tuple is followed by fewer than 6 [cr'^"'"^/'*'~^J /(a'^ — ^) + k < Vlo^l^~'' + k 
distinct characters. □ 

Corollary 5.8. Let A > 1, A; > and e > be constants. There exists a string 
r of length [a'^'^^^'^^^l over an alphabet of size a with K{r) = \r \ logo" — 0{l) but 
Hk{r') < (1/A - e) log a + 0{1) fori>l. 
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Proof. If r is randomly chosen, then K{r) > \r \ log cr — 1 with probability greater 



than 1/2 and, by Lemma 5.7, with high probability every possible fc-tuple is 
followed by (9(cr^/'^~^) distinct characters in r; therefore there exists an r with 
both properties. Every possible fc-tuple is followed by at most k more distinct 
characters in r* than in r and, thus, 

w is followed by j 



Hk{r^) < log max < j : 

t distinct characters in r* 

< logO(a^/^~^) 

< (l/A-e)log(T + 0(l) . □ 

Consider what we get if, for some e > 0, we allow the algorithm A from 
to use (9(cr^+^/'^~^) bits of memory, and evaluate it on the periodic 



Lemma 



5.6 



string r* from Corollary 5.8[ Since r* has period \_a''^^^^ ^\ and its repeated 



substring r has K{r) = \r\ logo" — C(l), the finite-state machine M outputs at 
least 

K{r) - K{M) - 0(log |M|) = |r| log a - 0{a''+^/^-') = \r\ log a - C(|r|) 

bits for each copy of r in r*, or nloga — 0{n) bits in total. Because \Hk{r^) < 
(1 — e) logo" -|- this yields the following nearly tight lower bound; notice it 



matches Theorem 5.4 except for a log^ a factor in the memory usage. 



Theorem 5.9. Let A > 1, A; > and e > be constants and let g be a function 
independent of n. In the worst case it is impossible to store a string s of length 
n over an alphabet of size a in XHk{s)n + o{nloga) + g bits using one encoding 
pass and O^a'^^^^^^'^) bits of memory. 

Proof. Let A be an algorithm that, given A, k, e and o", stores s while using one 
encoding pass and 0(^a'''^^^^~^^ bits of memory; we prove that in the worst case 
A stores s in more than {XHk{s) + + o{n log (t) + g bits. Again, we can model 
it with a finite-state machine M, with |M| = 2'^'^''^'''^"') and K{M) = C(logo"). 
Let r be a string of length [o"^+^/'^~''J with K{r) > \r \ logo" — 0{1) and Hk{r'^) < 
(1/A — e) logo" -|- 0{1) for i > 1, as described in Corollary 5.8, and suppose s = r* 
for some i. We can specify r by specifying M, the states M is in when it reaches 
and leaves any copy of r in s, and M's output on that copy. Therefore M outputs 
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at least 

K{r) - K{M) - = |r| log a - 0{\r\) 

bits for each copy of r in s, or n log a — 0{n) bits in total — which is asymptotically 
greater than XHk{s)n + o^nlogcr) + g < (1 — e)nlogcr + o(nlogcr) + g. □ 

With a good bound on how much memory is needed for compression, we turn 
our attention to decompression. Good bounds here are equally important, be- 
cause often data is compressed once by a powerful machine (e.g., a server or 
base-station) and then transmitted to many weaker machines (clients or agents) 
who decompress it individually. Fortunately for us, compression and decompres- 



sion are essentially symmetric. Recall Theorem 5^ says we can recover s from a 
(Aiffc(s) + yu)n + C(cr*''+^/'*' logcr))-bit encoding using one pass and 0(^a'^~^^^^ log^ a) 
bits of memory. Using the same idea about finite-state machines and periodic 
strings gives us the following nearly matching lower bound: 

Theorem 5.10. Let X > 1, k > and e > be constants and let g be a function 
independent of n. There exists a string s of length n over an alphabet of size a 
such that, given a {XHk{s)n + o{nlog(j) + g)-bit encoding of s, it is impossible to 
recover s using one pass and 0(^a''~^^^'^~'^) bits of memory. 

Proof. Let r be a string of length [a'^+^/^^^J with K{r) = |r|logcr — 0{1) but 
Hk{r^) < (1/A — e)log(T + C(l) for i > 1, as described in Corollary 5.8, and 
suppose s = r* for some i. Let A be an algorithm that, given A, k, e, a and a 
{XHk{s)n + o(?T,logo") + g)-hit encoding of s, recovers s using one pass; we prove 
A uses uj{a''^^^'^^'^) bits of memory. Again, we can model A with a finite-state 
machine M, with log \M\ equal to the number of bits of memory A uses and 
K{M) = 0{loga). We can specify r by specifying M, the state M is in when it 
starts outputting any copy of r in s, and the bits of the encoding it reads while 
outputting that copy of r; therefore 

K{r) < K{M) + 0{\og\M\) + {XHk{s)n + o{n\oga) + g) ji 

< C(logor) + 0(log|M|) + |r| ((1 -e) loga + o(logor) + g/n) 

< (1 - e)|r| logo- + o(|r| logo") + 0(log |M|) + g/n, 



so 



C(log \M\) + g/n > e\r\\oga - o{\r\\oga) = n{a''+^/^~' log a) . 
The theorem follows because n can be arbitrarily large compared to g. □ 
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Chapter 6 

Stream Compression 



Massive datasets seem to expand to fill the space available and, in situations when 
they no longer fit in memory and must be stored on disk, we may need new models 
and algorithms. Grohe and Schweikardt |GS05j introduced read/write streams to 
model situations in which we want to process data using mainly sequential accesses 
to one or more disks. As the name suggests, this model is like the streaming model 
(see, e.g., |Mut05] ) but, as is reasonable with datasets stored on disk, it allows us 
to make multiple passes over the data, change them and even use multiple streams 
(i.e., disks). As Grohe and Schweikardt pointed out, sequential disk accesses are 
much faster than random accesses — potentially bypassing the von Neumann bot- 
tleneck — and using several disks in parallel can greatly reduce the amount of 
memory and the number of accesses needed. For example, when sorting, we need 
the product of the memory and accesses to be at least linear when we use one 
disk |MP8ni K^KSn?; but only polylogarithmic when we use two \CY91\ [GSOH] . 
Similar bounds have been proven for a number of other problems, such as check- 
ing set disjointness or equality; we refer readers to Schweikardt's survey |Sch07] 
of upper and lower bounds with one or more read/write streams, Heinrich and 
Schweikardt's recent paper |HS08j relating read/write streams to classical com- 
plexity theory, and Beame and Huynh-Ngoc's recent paper |BH08j on the value 
of multiple read/write streams for approximating frequency moments. 

Since sorting is an important operation in some of the most powerful data 
compression algorithms, and compression is an important operation for reducing 
massive datasets to a more manageable size, we wondered whether extra streams 
could also help us achieve better compression. In this chapter we consider the 
problem of compressing a string s of n characters over an alphabet of size a when 
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we are restricted to usine log'^(^) n bits of memory and log ^ ' n passes over the 
data. In Section 6J^, we show how we can achieve universal compression using 
only one pass over one stream. Our approach is to break the string into blocks and 
compress each block separately, similar to what is done in practice to compress 
large files. Although this may not usually significantly worsen the compression 
itself, it may stop us from then building a fast compressed index |FMMNd7] 
(unless we somehow combine the indexes for the blocks) or clustering by com- 
pression |CV05t |FGG"'"07] (since concatenating files should not help us compress 
them better if we then break them into pieces again). In Section 6.2 we use a 



vaguely automata-theoretic argument to show one stream is not sufficient for us 
to achieve good grammar-based compression. Of course, by "good" we mean here 
something stronger than universal compression: we want the size of our encoding 
to be at most polynomial in the size of the smallest context-free grammar than 
generates s and only s. We still do not know whether any constant number of 
streams is sufficient for us to achieve such compression. Finally, in Section 6.3| we 
show that two streams are necessary and sufficient for us to achieve entropy-only 
bounds. Along the way, we show we need two streams to find strings' minimum 
periods or compute the Burrows- Wheeler Transform. As far as we know, this is 
the first study of compression with read/write streams, and among the first stud- 
ies of compression in any streaming model; we hope the techniques we use will 
prove to be of independent interest. 



6.1 Universal compression 

An algorithm is called universal with respect to a class of sources if, when a string 
is drawn from any of those sources, the algorithm's redundancy per character 
approaches with probability 1 as the length of the string grows. The class most 
often considered, and which we consider in this section, is that of stationary, 
ergodic Markov sources (see, e.g., |CT06j ). Since the fcth-order empirical entropy 
Hk{s) of s is the minimum self-information per character of s with respect to a 
fcth-order Markov source (see jSav97j ). an algorithm is universal if it stores any 
string s in nHk{s) + o{n) bits for any fixed a and k. The fcth-order empirical 
entropy of s is also our expected uncertainty about a randomly chosen character 
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of s when given the k preceding characters. Specifically. 



IM) occ(a, s) log — T-^ if A; = 0, 

' ' Z-^a V ' / o occ(a,s) ' 

V^) Y.\w\=k \ws\Ho{ws) otherwise 



where occ{a, s) is the number of times character a occurs in s, and Ws is the 
concatenation of those characters immediately following occurrences of /c-tuple w 
in s. 

In a previous paper |GM07b] we showed how to modify the well-known LZ77 
compression algorithm |ZL77] to use sublinear memory while still storing s in 
nHk{s) + 0{nloglogn/ logn) bits for any fixed a and k. Our algorithm uses nearly 
linear memory and so does not fit into the model we consider in this chapter, 
but we mention it here because it fits into some other streaming models (see, 
e.g., |Mutn5j ) and, as far as we know, was the first compression algorithm to do 
so. In the same paper we proved several lower bounds using ideas that eventually 



led to our lower bounds in Sections 6.2 and 6.3 of this chapter. 



Theorem 6.1 (Gagie and Manzini, 2007). We can achieve universal compression 
using one pass over one stream and (9(n/log^n) bits of memory. 

To achieve universal compression with only polylogarithmic memory, we use 
a recent algorithm due to Gupta, Grossi and Vitter |GGV08j . Although they 
designed it for the RAM model, we can easily turn it into a streaming algorithm 
by processing s in small blocks and compressing each block separately. 

Theorem 6.2 (Gupta, Grossi and Vitter, 2008). In the RAM model, we can store 
any string s in nHk{s) + 0[a'' logn) bits, for all k simultaneously, using 0{n) 
time. 

Corollary 6.3. We can achieve universal compression using one pass over one 
stream and 0(\og^~^'^ Tij bits of memory. 

Proof. We process s in blocks of [log*^ n] characters, as follows: we read each block 



into memory, apply Theorem 6.2 to it, output the result, empty the memory, and 
move on to the next block. (If n is not given in advance, we increase the block 
size as we read more characters.) Since Gupta, Grossi and Vitter's algorithm 
uses 0{n) time in the RAM model, it uses 0{nlogn) bits of memory and we use 
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0(\og^~^'^ n) bits of memory. If the blocks are Si, . . . , Sb, then we store all of them 
in a total of 

b 

J2 {\s^\Hk{s,) + 0{aHoglogn)) < nHk{s) + O [a'^n log log n/ log' n) 

i=l 

bits for all k simultaneously. Therefore, for any fixed a and k, we store s in 
nHk{s) + o{n) bits. □ 

A bound of nHk{s)+0{a^n log log n/ log^ n) bits is not very meaningful when k 
is not fixed and grows as fast as log logn, because the second term is uj{n). Notice, 
however, that Gupta et al.^s bound of nHk{s) + C(cr'^logn) bits is also not very 



meaningful when k > logra, for the same reason. As we will see in Section [6731 it is 
possible for s to be fairly incompressible but still to have Hk{s) = for k > logn. 
It follows that, although we can prove bounds that hold for all k simultaneously, 
those bounds cannot guarantee good compression in terms of Hk{s) when k > 
logn. 

By using larger blocks — and, thus, more memory — we can reduce the 
log log n/ log*^ n) redundancy term in our analysis, allowing k to grow faster 
than log logn while still having a meaningful bound. We conjecture that the 
resulting tradeoff is nearly optimal. Specifically, using an argument similar to 



those we use to prove the lower bounds in Sections 6.2 and 6.3[ we believe we can 



prove that the product of the memory, passes and redundancy must be nearly 
linear in n. It is not clear to us, however, whether we can modify Corollary |6.3| to 
take advantage of multiple passes. 

Open Problem 6.4. With multiple passes over one stream, can we achieve better 
bounds on the memory and redundancy than we can with one pass? 



6.2 Grammar- based compression 

Charikar et al. [CLL+OSj and Rytter |Ryt03| independently showed how to build 
a context-free grammar APPROX that generates s and only s and is an 0{logn) 
factor larger than the smallest such grammar OPT, which is Q{logn) bits in size. 

Theorem 6.5 (Charikar et al, 2005; Rytter, 2003). In the RAM model, we can 
approximate the smallest grammar with |APPROX| = (9(|0PTp) using 0{n) time. 
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In this section we prove that, if we use only one stream, then in general our 
approximation must be superpolynomially larger than the smallest grammar. Our 
idea is to show that periodic strings whose periods are asymptotically slightly 
larger than the product of the memory and passes, can be encoded as small 
grammars but, in general, cannot be compressed well by algorithms that use only 
one stream. Our argument is based on the following two lemmas. 

Lemma 6.6. // s has period i, then the size of the smallest grammar for that 
string is 0{i + log n log log n) bits. 

Proof. Let t be the repeated substring and t' be the proper prefix of t such that 
s = t'^'^/^h' . We can encode a unary string Xl-"/^-! as a grammar Gi with 0{\ogn) 
productions of total size (9(log n log log ri) bits. We can also encode t and t' as 
grammars G2 and G3 with 0{tj productions of total size 0{tj bits. Suppose Si, 
5*2 and S'3 are the start symbols of Gi, G2 and Gs, respectively. By combining 
those grammars and adding the productions Sq S1S3 and X 5*2, we obtain 
a grammar with 0{i + log n) productions of total size 0{i + log n log log n) bits 
that maps Sq to s. □ 

Lemma 6.7. Consider a lossless compression algorithm that uses only one stream, 
and a machine performing that algorithm. We can compute any substring from 

• its length; 

• for each pass, the machine's memory configurations when it reaches and 
leaves the part of the stream that initially holds that substring; 

• all the output the machine produces while over that part. 

Proof. Let t be the substring and assume, for the sake of a contradiction, that 
there exists another substring t' with the same length that takes the machine 
between the same configurations while producing the same output. Then we can 
substitute t' for t in s without changing the machine's complete output, contrary 
to our specification that the compression be lossless. □ 



Lemma 6.7 implies that, for any substring, the size of the output the machine 
produces while over the part of the stream that initially holds that substring, plus 
twice the product of the memory and passes (i.e., the number of bits needed to 
store the memory configurations), must be at least that substring's complexity. 
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Therefore, if a substring is not compressible by more than a constant factor (as 
is the case for most strings) and asymptotically larger than the product of the 
memory and passes, then the size of the output for that substring must be at least 
proportional to the substring's length. In other words, the algorithm cannot take 
full advantage of similarities between substrings to achieve better compression. 
In particular, if s is periodic with a period that is asymptotically slightly larger 
than the product of the memory and passes, and s's repeated substring is not 
compressible by more than a constant factor, then the algorithm's complete output 
must be Q{n) bits. By Lemma 6.6, however, the size of the smallest grammar that 
generates s and only s is bounded in terms of the period. 

Theorem 6.8. With one stream, we cannot approximate the smallest grammar 
with |APPROX| < |OPT|^W. 

Proof. Suppose an algorithm uses only one stream, m bits of memory and p passes 
to compress s, with mp = log*^*^^-* n, and consider a machine performing that algo- 
rithm. Furthermore, suppose s is periodic with period \mp log n] and its repeated 
substring t is not compressible by more than a constant factor. Lemma [6171 implies 
that the machine's output while over a part of the stream that initially holds a 
copy of t, must be Q{mplogn—mp) = Q{mp\ogn). Therefore, the machine's com- 



plete output must be f2(ri) bits. By Lemma 6.6[ however, the size of the smallest 
grammar that generates s and only s is (9(mplogn + lognloglogn) C log'^*-^'' n 
bits. Since n = log"^^^^ n, the algorithm's complete output is superpolynomially 
larger than the smallest grammar. □ 

As an aside, we note that a symmetric argument shows that, with only one 
stream, in general we cannot decode a string encoded as a small grammar. To 
prove this, instead of considering a part of the stream that initially holds a copy of 
the repeated substring t, we consider a part that is initially blank and eventually 
holds a copy of t. We can compute t from the machine's memory configurations 
when it reaches and leaves that part, so the product of the memory and passes 



must be greater than or equal to t's complexity. Also, we note that Theorem 6^ 
has the following corollary, which may be of independent interest. 

Corollary 6.9. With one stream, we cannot find strings' minimum periods. 



Proof. Consider the proof of Theorem |6.8[ If we could find s's minimum period, 
then we could store s in log'^'-^-* n bits by writing n and one copy of its repeated 
substring t. □ 
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We are currently working on a more detailed argument to show that we cannot 
even check whether a string has a given period. Unfortunately, as we noted earlier, 
our results for this section are still incomplete, as we do not know whether multiple 
streams are helpful for grammar-based compression. 

Open Problem 6.10. With 0{1) streams, can we approximate the smallest gram- 
mar well? 

6.3 Entropy-only bounds 

Kosaraju and Manzini jKM99j pointed out that proving an algorithm universal 
does not necessarily tell us much about how it behaves on low-entropy strings. In 
other words, showing that an algorithm encodes s in nHk{s) + o{n) bits is not very 
informative when nHk{s) = o{n). For example, although the well-known LZ78 
compression algorithm |ZL78] is universal, |LZ78(1") = Q{^/n) while nifo(l") = 0. 
To analyze how algorithms perform on low-entropy strings, we would like to get 
rid of the o{n) term and prove bounds that depend only on nHk{s). Unfortunately, 
this is impossible since, as the example above shows, even nHo{s) can be for 
arbitrarily long strings. 

It is not hard to show that only unary strings have Ho{s) = 0. For k > 1, 
recall that Hk{s) = {i/n)J2\w\=k\'^s\Ho{ws)- Therefore, Hi:{s) = if and only 
if each distinct fc-tuple w in s is always followed by the same distinct character. 
This is because, if a w is always followed by the same distinct character, then 
Ws is unary, Hq{ws) = and w contributes nothing to the sum in the formula. 
Manzini |Man01j defined the fcth-order modified empirical entropy Hl{s) such that 
each context w contributes at least [log|ws|J + 1 to the sum. Because modified 
empirical entropy is more complicated than empirical entropy — e.g., it allows 
for variable-length contexts — we refer readers to Manzini's paper for the full 
definition. In our proofs in this chapter, we use only the fact that 

nHkis) < nHl{s) < nHk{s) + 0{a''\ogn) . 

Manzini showed that, for some algorithms and all k simultaneously, it is pos- 
sible to bound the encoding's length in terms of only nHl{s) and a constant 
that depends only on cr and k; he called such bounds "entropy-only". In par- 
ticular, he showed that an algorithm based on the Burrows- Wheeler Transform 
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(BWT) [BW94j stores any string s in at most (5 + e)nHl{s) + \ogn + gk bits for 
all k simultaneously (since nHl{s) > log{n — k), we could remove the logra term 
by adding 1 to the coefficient 5 + e). 

Theorem 6.11 (Manzini, 2001). Using the BWT, move-to-front coding, run- 
length coding and arithmetic coding, we can achieve an entropy-only bound. 

The BWT sorts the characters in a string into the lexicographical order of the 
suffixes that immediately follow them. When using the BWT for compression, it 
is customary to append a special character $ that is lexicographically less than 
any in the alphabet. For a more thorough description of the BWT, we again refer 
readers to Manzini's paper. In this section we ffist show how we can compute and 
invert the BWT with two streams and, thus, achieve entropy-only bounds. We 
then show that we cannot achieve entropy-only bounds with only one stream. In 
other words, two streams are necessary and sufficient for us to achieve entropy-only 
bounds. 

One of the most common ways to compute the BWT is by building a suffix ar- 
ray. In his PhD thesis, Ruhl introduced the StreamSort model |Ruh03t IADRR04] , 
which is similar to the read/write streams model with one stream, except that it 
has an extra primitive that sorts the stream in one pass. Among other things, he 
showed how to build a suffix array efficiently in this model. 

Theorem 6.12 (Ruhl, 2003). In the StreamSort model, we can build a suffix array 
using 0{logn) bits of memory and 0{\ogn) passes. 

Corollary 6.13. With two streams, we can compute the BWT using 0(logn) bits 
of memory and 0(\og^nj passes. 

Proof. We can compute the BWT in the StreamSort model by appending $ to s, 
building a suffix array, and replacing each value i in the array by the {i — l)st 
character in s (replacing either or 1 by $, depending on where we start counting). 
This takes 0{logn) bits of memory and 0{logn) passes. Since we can sort with 
two streams using O(logn) bits memory and O(logn) passes (see, e.g., |Sch07] ) . 
it follows that we can compute the BWT using 0{logn) bits of memory and 
0(log^ ra) passes. □ 

Now suppose we are given a permutation n on n + 1 elements as a list 7r(l), . . . , 
7r(n + 1), and asked to rank it, i.e., to compute the list 7r°(l),. . . ,71*^(1). This 
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problem is a special case of list ranking (see, e.g., |ABD+07j ) and has a surprisingly 
long history. For example, Knuth |Knu98l Solution 24] described an algorithm, 
which he attributed to Hardy, for ranking a permutation with two tapes. More 
recently. Bird and Mu |BM04j showed how to invert the BWT by ranking a 
permutation. Therefore, reinterpreting Hardy's result in terms of the read/write 
streams model gives us the following bounds. 

Theorem 6.14 (Hardy, c. 1967). With two streams, we can rank a permutation 
using O(logn) bits of memory and 0(log^ nj passes. 

Corollary 6.15. With two streams, we can invert the BWT using 0(logn) bits 
of memory and 0(\og^ nj passes. 

Proof. The BWT has the property that, if a character is the ith in BWT(s), then 
its successor in s is the lexicographically ith in BWT(s) (breaking ties by order of 
appearance). Therefore, we can invert the BWT by replacing each character by 
its lexicographic rank, ranking the resulting permutation, replacing each value i 
by the ith character of BWT(s), and rotating the string until $ is at the end. This 
takes 0{\ogn) memory and (9(log^n) passes. □ 

Since we can compute and invert move-to-front, run- length and arithmetic cod- 
ing using 0{logn) bits of memory and 0{1) passes over one stream, by combining 
Theorem 6.11 and Corollaries 6.13 and |6.15| we obtain the following theorem. 



Theorem 6.16. With two streams, we can achieve an entropy-only bound using 
0{logn) bits of memory andO(\og^nj passes. 

To show we need at least two streams to achieve entropy-only bounds, we use 



De Bruijn cycles in a proof similar to the one for Theorem 6.8 A fcth-order De 
Bruijn cycle |dB46j is a cyclic sequence in which every possible fc-tuple appears 
exactly once. For example. Figure |6T shows a 3rd-order and a 4th-order De Bruijn 



cycle. (We need consider only binary De Bruijn cycles.) Our argument this time 



is based on Lemma 6.7 and the following results about De Bruijn cycles. 



Lemma 6.17. If s & d* for some kth-order De Bruijn cycle d, then nHl{s) = 
0(2'= log n). 

Proof. By definition, each distinct fc-tuple is always followed by the same distinct 
character; therefore, nHk{s) = and nHl{s) = C(2*^logn). □ 
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1 1 

10 10 

1 10 

11 10 10 11 

Figure 6.1: Examples of 3rd-order and 4th-order De Bruijn cycles. 



Theorem 6.18 (De Bruijn, 1946). There are 2^*^ kth-order De Bruijn cycles. 

Corollary 6.19. We cannot store most kth-order De Bruijn cycles in o{2^) bits. 

Since there are 2^ possible fc-tuples, /cth-order De Bruijn cycles have length 
2^, so Corollary 6.19 means that we cannot compress most De Bruijn cycles by 
more than a constant factor. Therefore, we can prove a lower bound similar to 



Theorem |6.8| by supposing that s's repeated substring is a De Bruijn cycle, then 
using Lemma |6.17| instead of Lemma |6.6[ 

Theorem 6.20. With one stream, we cannot achieve an entropy-only bound. 



Proof. As in the proof of Theorem |6.8[ suppose an algorithm uses only one stream, 
m bits of memory and p passes to compress s, with mp = log'^'-^'' n, and consider 
a machine performing that algorithm. This time, however, suppose s is periodic 
with period 2r^°s(mpiogn-)l ^^d that its repeated substring t is a fcth-order De Bruijn 
cycle, k = [log(mp log n)] , that is not compressible by more than a constant factor. 
Lemma [6l7 implies that the machine's output while over a part of the stream that 
initially holds a copy of t, must be Q{mplogn — mp) = Q{mplogn). Therefore, 
the machine's complete output must be f2(n) bits. By Lemma |6.17| however, 
nHl{s) = C(2Mogn) = 0{mp\og^n) C log^^^^ n. □ 

Notice Theorem |6.20| implies a lower bound for computing the BWT: if we 
could compute the BWT with one stream then, since we can compute move-to- 
front, run-length and arithmetic coding using 0{logn) bits of memory and 0{1) 
passes over one stream, we could thus achieve an entropy-only bound with one 



stream, contradicting Theorem 6.20 



Corollary 6.21. With one stream, we cannot compute the BWT. 

Grohe and Schweikardt |GS05j proved that, with 0{1) streams, we generally 
cannot sort n/logn numbers, each consisting of logn bits, using 0{n^~'^) bits of 
memory and o(log n) passes. Combining this result with the following lemma, we 
immediately obtain a lower bound for computing the BWT with 0{1) streams. 
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Lemma 6.22. With two or more streams, sorting 0{n/ logn) numbers, each of 
logn bits, takes 0{logn) more bits of memory and 0{1) more passes than com- 
puting the BWT of a ternary string of length n. 

Proof. We reduce the problem of sorting a sequence Xi, . . . , of (log n)-bit binary 
numbers, m = n/{2 log n + log log n + 2) , to the problem of computing the BWT of 
a ternary string of length n. Let Xi[j] denote the jth bit of Xj. Using two streams, 
0(1) passes and O(logn) memory, we replace each xjj] by xjj] 2 Xi i j, writing 
2 as a single character, Xj and i each as logn bits, and j as log logn bits. Let 
X be the resulting string and consider the last m log n characters of the BWT of 
X: they are a permutation of the characters followed by 2s in X, i.e., the bits of 
Xi, . . . , Xm', if Xi < Xi' or Xi = Xi' but i < i' then, because 2 Xi i is lexicographically 
less than 2 Xi' i', each bit of Xj comes before each bit of Xj/; if j < j' then, for 
any i, because 2 Xi i j is lexicographically less than 2 Xi i j', the bit Xi[j] comes 
before the bit Xj[j']. In other words, the last mlogn characters of the BWT of X 
are x^, . . . , Xui m sorted order. □ 

Corollary 6.23. With 0{1) streams, we cannot compute the BWT of a ternary 
string of length n using 0{n^~^) bits of memory and o{\ogn) passes. 

In another paper |GM07a] we improved the coefficient in Manzini's bound 
from 5 + e to 2.7, using a variant of distance coding instead of move-to-front and 
run-length coding. We conjecture this algorithm can also be implemented with 
two streams. 

Open Problem 6.24. With 0{1) streams, can we achieve the same entropy-only 
bounds that we achieve in the RAM model? 

The main idea of distance coding |BinOO] is to write the starting position of 
each maximal run (i.e., subsequence consisting of copies of the same character), 
by writing the distance from the start of each maximal run to the start of the next 
maximal run of the same character. Notice we do not need to write the length 
of each run because the end of each run (except the last) is the position before 
the start of the next one. By symmetry, it makes essentially no difference to the 
length of the encoding if we write the distance to the start of each maximal run 
from the start of the previous maximal run of the same character, which is not 
difficult with 0{a\ogn) bits of memory and 0{1) passes. 
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Kaplan, Landau and Verbin |KLV07j showed how, using the BWT followed by 
distance coding and arithmetic coding, we can store s in 1.73nHk{s) + 0{logn) 
bits for any fixed a and k. This bound holds only when we use an idealized 
arithmetic coder with 0{\og n) total redundancy; if we use a Oth-order coder with 
per character redundancy then the bound becomes 1.73niffc(s) + /in + 0(logn). 
In our paper we used a lemma due to Makinen and Navarro |MN05j bounding the 
number of runs in terms of the of the product of the length and the Oth-order 
empirical entropy, to change the latter bound into (1.73 + fi)nHk{s) + 0{logn), 
which is an improvement when Hk{s) < 1. Unfortunately, the presence of the 
0{logn) term prevents this from being an entropy-only bound. To prove an 
entropy-only bound, we modified distance coding to use an escape mechanism, 
which we have not verified can be implemented in the read/write streams model. 
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In this thesis we have tried to provide a fairly complete but coherent view of our 
studies of sequential-access data compression, balancing discussion of previous 
work with presentation of our own results. We would like to highlight now what we 
consider our key ideas. The most important innovation in Chapter [2] was probably 
our use of predecessor queries for encoding and decoding with a canonical code. 
This, combined with our use of Fredman and Willard's data structure |FW93] . 
Shannon coding and background processing, allowed us to encode and decode each 
character in constant worst-case time while producing an encoding whose length 
was worst-case optimal. Chapters [3] and |4] were, admittedly, somewhat tangential 
to our topic, but we included them to show how our interests shifted from the 
model we considered in Chapter [2] to the one we considered in Chapter [5j The 
key idea in Chapter [5] was to view one-pass algorithms with memory bounded 
in terms of the alphabet size and context length as finite-state machines. This, 
combined with the fact that short, randomly chosen strings almost certainly have 
low empirical entropy, allowed us to prove a lower bound on the amount of memory 
needed to achieve good compression, that nearly matched our upper bound (which 
was relatively easy to prove, given Lemma 5.1 ). Finally, the key idea in Chapter |6| 



was to to extend the automata-theoretic arguments of Chapter [5] to algorithms 
that can make multiple passes and use an amount of memory that depends on the 
length of the input. This gave us our lower bound for achieving good grammar- 
based compression with one stream, our lower bound for finding strings' minimum 
periods and, combined with properties of De Bruijn sequences, our lower bound 
for achieving entropy-only bounds. 

As we mentioned in the introduction, a paper |GKN09] we wrote with Marek 
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Karpinski and Yakov Nekrich at the University of Bonn that partially combining 
the results in Chapters [2] and [5} will appear at the 2009 Data Compression Con- 
ference. This paper concerns fast adaptive prefix coding with memory bounded 
in terms of the alphabet size and context length, and shows that we can en- 
code s in {\H + 0{l))n + o(n) bits while using (9(ct^/^"'"'^) bits of memory and 
O(loglogcr) worst-case time to encode and decode each character. Of course, we 
would like to improve these bounds, and perhaps implement and test how our 
algorithm performs with large alphabets such as Chinese, Unicode or the English 
vocabulary. We would also like to implement our algorithm from Chapter |2| test- 
ing several implementations of dictionaries to determine which is the fastest in 
practice; Fredman and Willard's analysis has enormous constants hidden in the 
asymptotic notation. Finally, we are preparing a paper with Nekrich that will give 
efficient algorithms for adaptive alphabetic prefix coding, adaptive prefix coding 
for unequal letter costs, and adaptive length-restricted prefix coding (see |Gag07a 
for descriptions of these problems). 

As we also mentioned in the introduction, we are currently investigating whether 
we can prove any more results like the lower bound in Chapter [6]on finding strings' 
minimum periods. We are working on the open problems presented in Chapter [6} 
about using multiple passes to obtain smaller redundancy terms for universal 
compression with one stream, approximating the smallest grammar with 0{1) 
streams, and achieving better entropy-only bounds with 0{1) streams. Finally, 
we have been collaborating with Paolo Ferragina at the University of Pisa and 
Giovanni Manzini at the University of Eastern Piedmont on a paper |FGMj about 
BWT-based compression in the external memory model (see jVitOSj ) with limited 
random disk accesses. For the moment, however, our curiosity about sequential- 
access data compression is mostly satisfied. 



After proving our first results about adaptive prefix coding Gag07a , we wrote 



several papers |Gag06a[ |Gag06b[ |Gag08[ |Gaga| concerning the number of bits 
needed to store a good approximation of a probability distribution and, more 
generally, a Markov process. For one of these papers |Gag06b| , about bounds 
on the redundancy in terms of the alphabet size and context length, we proved 
versions of Lemmas |5.1 and 5.7 which eventually led to Chapters |5] and |6] We are 
now curious whether our results can be combined with algorithms that build so- 
phisticated probabilistic models, either for data compression (see, e.g., |FGMS05| 
IF(;M061IFM08] ) or for inference (see, e.g., [RM iRSTMj TOIOt IBYOT] and sub- 
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sequent articles). These algorithms work by considering a class of probabilistic 
models that are, essentially, Markov sources with variable-length contexts, and 
finding the model that minimizes the sum of the length of the model's description 
and the self-information of the input with respect to the model; we note this sum 
is something like the fcth-order modified empirical entropy. Similar kinds of mod- 
els are used in both applications because many algorithms for inference are based 
on Rissanen's Minimum Description Length Principle |Ris78j . which is based on 
ideas from data compression. 

How we minimize the sum of the length of the model's description and the 
self-information depends on how we represent the model. At least some of the 
algorithms mentioned above assume that the length of description is proportional 
to the number of contexts used. However, it seems that, if some contexts occur 
frequently but the distributions of characters that follow them are nearly uniform, 
and other occur rarely but are always followed by the same character, then it might 
give better compression to prune the former and keep the latter, which take only 
C(logcr) bits each to store. Of course, this is just speculation at the moment. 
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